The Movie Database (TMDb) is a community-built movie and TV database. Every piece of data has been added by e amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different. Using our extensive dataset acqiuired from this source (TMDB) we will be analyzing and looking for valuable insights from this data.
We begin by importing all the necessary dependencies for this project. We'll be needing Pands, Numpy, Matplotlib and Seaborn libraries to accomplish our aims in this project.
# Use this cell to set up import statements for all of the packages that you
# plan to use.
# Remember to include a 'magic word' so that your visualizations are plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.animation as animation
import seaborn as sns
from IPython.display import HTML
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In this section of this report, we load in the data, evaluate the cleanliness of the data, and determine whether any data cleaning of the dataset is necessary before we proceed with our analysis.
Step One: Here, we check for properties like: measures of centre and spread (mean, stdev, variance), minimum and maximum values, shape of data and unique values in our dataset.
Findings: Here, we got preliminary information about our dataset. As we proceed, we will base our decisions on this information.
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')
df.head()
print ('This is the number of rows and columns in this table:', df.shape)
df.nunique()
df.describe()
Step Two: Here, we check for missing values and decide whether to drop missing fields or fill missing fields. From the data, we also evaluate the best approach to do this.
Findings: Here, we found that the all missing values are object-only data types. This means that no field of numeric data type contain missing values.
df.isnull().sum()
print (df.isna().sum());
print (' ')
print (' ')
print (df.dtypes)
Step Three: Here, we check for duplicate values in our dataset then we evaluate the best approach to eliminating the duplicates.
Findings: We found that only one row exists as a duplicate.
print ('The total number of duplicates in our dataset is:' , df.duplicated().sum())
Decision: Here, we decided to fill some missing fields with placeholders and drop other missing columns (.e.g. Production Companies, Homepage, Tagline, Overview and Cast) that are unnecessary in our analytics. We have decided to do this for two main reasons.
Reason 1: Because all our missing values are string/object-type fields, dropping these fields(rows) will mean removing all numeric data attached to them as well. This is not good.
Reason 2: Because the amount of all missing values is significant (above 30%) when compared to the res of our entire datset. Therefore, dropping all rows associated with each missing value would greatly impact our analysis.This is also not good.
#Removing the following columns
# homepage
#df.drop('homepage', axis=1, inplace=True)
# confirming whether Homepage column has been dropped from table
# number of columns should be 20 (.i.e. 21 - 1 = 20)
print (df.shape)
print (' ')
print (df.isna().sum())
# dropping remaining columns - producion_companies, tagline, cast, overview
df.drop(['cast', 'tagline', 'overview', 'production_companies'] , axis=1, inplace=True)
# confirming whether remaining columns have been dropped from table
# number of columns should be 16 (.i.e. 20 - 4 = 16 columns)
print (df.shape)
print (' ')
print (df.isna().sum())
# including keywords column in thelist of columns to be dropped
df.drop(columns='keywords',inplace=True)
# then checking to confirm removal of the column
print (df.shape)
print (df.isnull().sum())
# time to fill the 3 remaining empty columns - imdb_id, director and genres
# replacing missing values with text.
df['imdb_id'].fillna('No ID', inplace=True)
df['genres'].fillna('No Genres', inplace=True)
df['director'].fillna('No Director', inplace=True)
#confirming whether missing values have been filled.
# This means, the sum of missing values (df.isnull().sum()) should all be zero.
df.isnull().sum()
#last but not least, it's time to drop the one and only duplicated value.
df.drop_duplicates(inplace=True)
#Now, number of duplicates in our data should equal to zero.
df.duplicated().sum()
#now we show a distrubution of our final data before we go into exploration and answering our questions.
df.hist(figsize = (15,15));
We can see the distribution above that the majprity of our values are right skewed. Specifically, "budget, "budget_adj", "popularity", "revenue", "revenue_adj", "vote_count" all have higher max values.
"vote_average" distribution is almost normal. And "release_year" is left-skewed whch tells us that number of movies released increased every year.
In this section, we compute statistics and create visualizations with the goal of addressing the research questions that we have posed in the Introduction section. Our goal is to approach this systemmatically. To look at one variable at a time, and then follow it up by looking at relationships between variables.
To answer this question, we'll take the following two steps:
Step One: We seperate names of genres into distinct genres. And find the total number of genres.
Step Two: We make a chart that compares the popularityh of each distinct genre.
genres = df.genres.str.split('|', expand=True).stack().value_counts().index
print("Number of genres is {}".format(genres.size))
We have 20 genres overall. Let's create a color map for them, so that every genre will have a unique color. Choosing colors is a very complicated task, so we’ll use the built-in matplotlib “tab20” colormap that has exactly 20 colors with a good-looking palette.
colors_map = {}
cm = plt.cm.get_cmap('tab20')
#we have 20 colors in [0-1] range
#so start from 0.025 and add 0.05 every cycle
#this way we get different colors for
#every genres
off = 0.025
for genre in genres:
colors_map[genre] = cm(off)
off += 0.05
Let's create a function that returns a sorted dataframe with dependency of values from a multiple value column and a single value column. This will help us to analyse all multiple values columns.
def get_mdepend(df, multival_col, qual_col):
#split column by '|' character and stack
split_stack = df[multival_col].str.split('|', expand=True).stack()
#convert series to frame
split_frame = split_stack.to_frame(name=multival_col)
#drop unneeded index
split_frame.index = split_frame.index.droplevel(1)
#add qual_col, group and find average
dep = split_frame.join(df[qual_col]).groupby(multival_col).mean()
#return sorted dependency
return dep.sort_values(qual_col)
Next we'll create a function that plots our horizontal bar chart with the popularity of movies for all genres up to the desired year.
def draw_barchart_frame(current_year):
#get data only up to current_year
dep = get_mdepend(df.query('release_year <= {}'.format(current_year)),
'genres', 'popularity')
#clear before draw
ax.clear()
#plot horizontal barchart using our colormap
ax.barh(dep.index,
dep['popularity'].tolist(),
color=[colors_map[x] for x in dep.index])
#plot genres and values
dx = dep.max() / 200
for i, (value,
name) in enumerate(zip(dep['popularity'].tolist(), dep.index)):
#genre name
ax.text(value - dx,
i,
name,
size=14,
weight=600,
ha='right',
va='center')
#genre value
ax.text(value + dx,
i,
f'{value:,.2f}',
size=14,
ha='left',
va='center')
#big current year
ax.text(1,
0.2,
current_year,
transform=ax.transAxes,
color='#777777',
size=46,
ha='right',
weight=800)
#plot caption of ticks
ax.text(0,
1.065,
'Popularity',
transform=ax.transAxes,
size=14,
color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.1f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=12)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='major', axis='x', linestyle='-')
ax.set_axisbelow(True)
#chart caption
ax.text(0,
1.16,
'Popular Genres from 1960 to 2015',
transform=ax.transAxes,
size=24,
weight=600,
ha='left',
va='top')
Finally we'll create an animation.
#create figure
fig, ax = plt.subplots(figsize=(10, 7))
#remove borders
plt.box(False)
#immediately close it to not provide additional figure
#after animation block
plt.close()
animator = animation.FuncAnimation(fig,
draw_barchart_frame,
frames=range(1960, 2016),
interval=666)
#add space before animation
print('')
HTML(animator.to_jshtml())
To answer question number 2, let's start by finding any correlation between movie revenue and other properties. For this we'll make use of the "revenue_adj" (Adjusted Revenue).
# Continue to explore the data to address your additional research
# questions. Add more headers as needed if you have more questions to
# investigate.
df.head()
sns.pairplot(data=df,
x_vars=['popularity', 'budget', 'runtime'],
y_vars=['revenue_adj'],
kind='reg');
sns.pairplot(data=df,
x_vars=['vote_count', 'vote_average', 'release_year'],
y_vars=['revenue_adj'],
kind='reg');
From the results above, we can see that...
"popularity" and "vote_count" have a positive correlation with revenue. This makes sense, because the more people who watch a movie, the more revenue it gets.
"budget" has a smaller positive correlation with revenue. This goes to show that higher investments in movies results in higher revenues.
"vote_average" has a weaker positive correlation with revenue. This is quite surprising, because it would be expected that a movie with higher votes would have more people watching hence higher revenues. However, this is not the case here.
Finally we can conclude on the following from our findings:
Research Question 1: our analysis of movie genres versus popularity from 1960 to 2015 shows that the "Thriller" movie genre was the most popular in 1960 while the "Adventure" movie genre was the most popular in the 2015.
Research Question 2: our analysis have led us to a conclusion that the most common properties of hugh grossing moves are massive "popularity" of the films, a significantly large "budget" and "vote_counts".
References:
https://github.com/topics/investigate-tmdb-movies
www.google.com
www.stackoverflow.com
https://github.com/deepak525/Investigate_TMDb_Movies/blob/master/investigate_the_TBMb_Dataset.ipynb